This data set contains 113,937 loans with 81 variables on each loan, including loan amount, borrower rate (or interest rate), current loan status, borrower income, and many others. In this project, the analysis will be structured to provide simple univariate relationships to multivariate relationships; the study will address questions like whether or not the monthly loan payment has a correlation to the loan original amount, what is the term of the loan in terms of loan status, identifying the frequency of the categorical variables, such as loan term, borrower's employment status, year of loan, and loan status, are there differences between individual loans depending on how large the original loan amount was?
The report in this section will be organised to give an overview of simple univariate relationships to multivariate relationships. This research provides answers to a number of questions, such as whether the monthly loan payment has a correlation or any relationship with the loan original, amount, what is the spread of the term of the loan in terms of loan status, and identifying the frequency of the categorical variables. This would produce important insights that may be used in a presentation. Although the dataframe contains 81 features, this study is only interested in a select handful of them, so it would be reasonable to reduce the dataframe to the relevant columns. A variety of features would help this study, including the following features, in order to gain a better idea of how this feature of interest would be studied. Original loan amount, loan origination date, monthly loan payment, days since the last payment was made, stated monthly income, investors, and recommendations. To create a new dataframe that may be used as a reference for exploration and analysis, some features in all were collected.
There are values in the loan status that represent past due in several categories of days; these values have been replaced with a single value called "past due" that applies regardless of how many days have passed. The stated monthly income and monthly loan payment variables, which were converted from float to integer for compatibility with the loan amount data type, were not left out of the transformation of the borrower state values from state abbreviation to full text. The object data type of the occupation column was changed to a categorical data type.
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
sns.set_palette("Set2", 8, .75)
%matplotlib inline
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
# load dataset in dataframe
prosperloan_reduced = pd.read_csv('prosperloan_reduced.csv')
# convert object to categorical datatype
prosperloan_reduced.Term = prosperloan_reduced.Term.astype('category')
prosperloan_reduced.Year = prosperloan_reduced.Year.astype('category')
prosperloan_reduced.LoanStatus = prosperloan_reduced.LoanStatus.astype('category')
prosperloan_reduced.BorrowerState = prosperloan_reduced.BorrowerState.astype('category')
prosperloan_reduced.Occupation = prosperloan_reduced.Occupation.astype('category')
prosperloan_reduced.EmploymentStatus = prosperloan_reduced.EmploymentStatus.astype('category')
There are values in the credit status that address past due in a few classifications of days; these qualities have been supplanted with a solitary worth called "past due" that applies paying little heed to how long have passed. The expressed month to month pay and month to month credit installment factors, which were changed over from float to whole number for similarity with the advance sum information type, were not avoided with regards to the change of the borrower state values from state shortening to full text. The item information kind of the occupation section was changed to a downright information type.
# define a func to plot hist
def draw_hist(x, title):
"""plot histogram to show
dist of numeric variable"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(14,8), dpi = 400)
plt.hist(x = x)
plt.title(title)
plt.xlabel('Amount (Dollars)', fontsize = 10)
plt.ylabel('Distribution', fontsize = 10)
# calling the function to plot hist of loan original amount
draw_hist(prosperloan_reduced.LoanOriginalAmount, 'Histogram Distibution of Loan Original Amount.')
The monthly loan payment is also right-skewed, a case of symmetrical distribution. Most of the monthly loan payment are clustered on the left side of the histogram. The peak of the original loan amount occurs at about 173 dollars, the data spread is from about zero dollars to 2251 dollars.
# calling the function to plot hist of monthly loan payment
draw_hist(prosperloan_reduced.MonthlyLoanPayment, 'Histogram Distibution of Monthly Loan Payment.')
# define a func to plot kernel density estimate
def draw_kde(x, title):
"""plot a kernel density estimate"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(14,8), dpi = 400)
sns.kdeplot(x = x, data = prosperloan_reduced, fill = True)
plt.xlabel('Amount(Dollars)')
plt.title(title)
# plot kernel density estimate for loan original amount
draw_kde('LoanOriginalAmount', 'Kernel Density Estimate for Loan Original Amount.')
in order to locate a kernel density estimate data point of the loan's initial amount. specifically, the probability density function of the data points. Due to the possibility of calculating probabilities, densities are helpful. The likelihood that a randomly chosen monthly lona payment will fall between $300 and $500 may be determined from the image below as the area between the density function (graph) and the x-axis in the range [300, 500].
# plot kernel density estimate for monthly loan payment
draw_kde('MonthlyLoanPayment', 'Kernel Density Estimate for Monthly Loan Payment.')
# a func to plot a univariate countplot
def univariate_count_plot(x, title):
"""plot a countplot"""
"""param: x, title"""
"""return none"""
plt.figure(figsize=(14,8), dpi = 400)
ax = sns.countplot(x = x, data = prosperloan_reduced, color = 'blue')
for p in ax.patches:
ax.annotate('{:.0f}'.format(p.get_height()), (p.get_x()+0, p.get_height()+0.05))
plt.suptitle(title)
plt.xticks(rotation = 90)
plt.plot()
plt.show();
The visuals below show that loans disbursed on the medium term, in this case 36 months, have the highest occurence with a count of 87778, representing approximately 77 percent of loan term duration, with the remaining 23 percent distributed between the long term (60 months) and short term (12 months) loan durations. This information was used to determine the frequency of the categorical variables term of loan.
# calling the function to draw countplot of the feature term
univariate_count_plot('Term', 'Term of Loan Distribution.')
To identify the frequency of the categorical variables borrower's employment status; it was discovered from the visuals below that those who are employed has the highest occurence in the employment status category with a count of 69557, those who are retired got the lowest occurence in the employment status category, it's more likely to disburse a laon to working class compare to a retired individual.
# calling the function to draw countplot of the feature Employment Status
univariate_count_plot('EmploymentStatus', "Borrower's Employment Status Distribution.")
The visuals below were used to determine the frequency of the categorical variables year. It was found that the year 2013 had the highest number of loan disbursements with an occurence of 34345, followed by the years 2012 and 2014, respectively, at second and third position, and that the year 2005 had the fewest disbursements with an occurence of 22.
# calling the function to draw countplot of the feature term
univariate_count_plot('Year', 'Loan Distribution by Year.')
# define a function to plot lineplot
def line_plot(x, y, title):
"""plot a line plot"""
"""param: x, y, title"""
"""return none"""
plt.figure(figsize=(14,8), dpi= 400)
p = sns.lineplot(x = x, y = y, data = prosperloan_reduced)
plt.xlabel('Loan Original Amount(Dollars)')
plt.ylabel('Monthly Loan Payment(Dollars)')
plt.title(title);
To establish any relationship or correlation between the continuous numerical variables; loan original amount, and monthly loan payment it was gathered from the visuals below that a positive correlation between the two variables, as the original loan amount increases the monthly loan payment increase relatively.
# calling the function to lineplot of monthly loan payment against loan original amount
line_plot('LoanOriginalAmount', 'MonthlyLoanPayment', 'Line Graph Depicting Relationship Between Monthly Loan Payment and Loan Original Amount.')
# define func to plot scatter plot loan original amount against current days of delinquency
def draw_scatter(hue, title):
"""plot a scatterplot"""
"""param: hue, title"""
"""return none"""
plt.figure(figsize=(14,8), dpi = 400)
sns.scatterplot(x = 'LoanOriginalAmount', y = 'MonthlyLoanPayment', hue = hue, data = prosperloan_reduced)
plt.xlabel('Loan Original Amount(Dollars)')
plt.ylabel('Monthly Loan Payment(Dollars)')
plt.title(title);
It was established from earlier findings that there exists a positive relationship between the loan original amount and monthly loan payment; the data points are dispersed across the scatterplot below, categorised by term of loan, to illustrate the relationship between three variables, two continuous numerical variables (loan original amount and monthly loan payment) and a categorical variable (term).
# calling the function to plot scatter plot group by loan term
draw_scatter('Term', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Term.')
The data points are dispersed across the scatterplot below, each categorised by the year of the loan, to illustrate the relationship between three variables: the loan original amount, the monthly loan payment, and the year. It was established from earlier findings that there exists a positive relationship between the loan original amount and monthly loan payment.
# calling the function to plot scatter plot group by year
draw_scatter('Year', 'Original Loan Amount Against Current Days of Delinquency Group by Loan Year.')
The figure below shows how to illustrate the correlation of numerical variables based on linear properties between variables by plotting a heatmap of a correlation matrix. From the heatmap, we can see that there is a positive correlation between the original loan amount and the monthly loan payment, with a correlation coefficient of 0.93, while there appears to be no correlation between the stated monthly income and the original loan amount.
# plot a heatmap to show correlation
plt.figure(figsize=(14,8), dpi = 400)
colormap = sns.color_palette('Greens')
sns.heatmap(prosperloan_reduced.corr(), annot = True, cmap = colormap, center = 0)
plt.title('Correlation Matrix Depicting Relationship Between Variable with Heatmap.')
# creating a slide show
!jupyter nbconvert Alexander_Yirenkyi_Project3_Part_II.ipynb --to slides --post serve --no-input --no-prompt